Combination of data in pandas is performed by concatenating two sets of data, where data is combined simply along either axes but without regard to relationships in the data. Or data can be combined using relationships in the data by using a pandas capability referred to as merging, which provides join operations that are similar to those in many relational databases.
Reshaping - There are three primary means of reshaping the data.
In [1]:
# import pandas, numpy and datetime
import numpy as np
import pandas as pd
import datetime
# set some pandas options for controlling output
pd.set_option('display.notebook_repr_html',False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)
In [2]:
# two series objects to concatenate
s1 = pd.Series(np.arange(0,3))
s2 = pd.Series(np.arange(5,8))
s1
Out[2]:
In [3]:
s2
Out[3]:
In [4]:
# concatenate them
pd.concat([s1,s2])
Out[4]:
In [5]:
# dataframe objects can also be concatenated
# create two dataframe objects to concatenate
# using the same index labels and column names
# but different values
df1 = pd.DataFrame(np.arange(9).reshape(3,3), columns=['a','b','c'])
df2 = pd.DataFrame(np.arange(9,18).reshape(3,3),columns=['a','b','c'])
pd.concat([df1,df2])
Out[5]:
In [6]:
# demonstrate concatenating two dataframe objects
# with different columns
df1 = pd.DataFrame(np.arange(9).reshape(3,3), columns=['a','b','c'])
df2 = pd.DataFrame(np.arange(9,18).reshape(3,3),columns=['a','c','d'])
pd.concat([df1,df2])
Out[6]:
In [9]:
# concat the two objects, but create an index using the given keys
c = pd.concat([df1,df2],keys=['df1','df2'])
c
Out[9]:
In [10]:
# we can extract data originating from the first or second source dataframe
c.loc['df2']
Out[10]:
The pd.concat() function also allows you to specify the axis on which to apply the concatenation.
In [11]:
# concat df1 and df2 along columns
# aligns on row labels, has duplicate columns
pd.concat([df1,df2],axis=1)
Out[11]:
In [12]:
# a new dataframe to merge with df1
# this has two common row labels(2,3)
# common columns (a) and one disjoint column
# in each (b in df1 and d in df2)
df3 = pd.DataFrame(np.arange(20,26).reshape(3,2),columns=['a','d'],index=[2,3,4])
df3
Out[12]:
In [13]:
# concat them. Alignment is along row labels
# columns first from df1 and then df3, with duplicates.
# NaN filled in where those columns do not exist in the source
pd.concat([df1,df3],axis=1)
Out[13]:
A concatenation of two or more DataFrame objects actually performs an outer join operation along the index labels on the axis opposite to the one specified. This makes the result of the concatenation similar to having performed a union of those index labels, and then data is filled based on the alignment of those labels to the source objects.
The type of join can be changed to an inner join and can be performed by specifying join='inner' as the parameter. The inner join then logically performs an intersection instead of a union.
In [14]:
# do an inner join instead of outer
# results in one row
pd.concat([df1,df3],axis=1,join='inner')
Out[14]:
In [15]:
# add keys to the columns
df = pd.concat([df1,df2],axis=1,keys=['df1','df2'])
df
Out[15]:
In [16]:
# retrieve the data that originated from the
# DataFrame with key = df2
df.loc[:,'df2']
Out[16]:
In [17]:
# append does a concatenate along axis = 0
# duplicate row index labels can result
df1.append(df2)
Out[17]:
In [18]:
# remove duplicates in the result index by ignoring the
# index labels in the source DataFrame objects
df1.append(df2,ignore_index=True)
Out[18]:
pandas allows the merging of pandas objects with database-like join operations using the pd.merge() function and the .merge() method of a DataFrame object. These joins are high performance and are performed in memory. A merge combines the data of two pandas objects by finding matching values in one or more columns or row indexes.
In [24]:
# these are customers data
customers = {'CustomerID':[10,11],'Name':['Mike','Marcia'],'Address':['Address for Mike','Address for Marcia']}
customers = pd.DataFrame(customers)
customers
Out[24]:
In [25]:
# and these are the orders made by our customers
# they are related to customers by CustomerID
orders = {'CustomerID':[10,11,10],'OrderDate':[datetime.date(2014,12,1),datetime.date(2014,12,1),datetime.date(2014,12,1)]}
orders = pd.DataFrame(orders)
orders
Out[25]:
In [26]:
# merge customers and orders so we can ship the items
customers.merge(orders)
Out[26]:
pandas has done something magical for us here by being able to accomplish this with such a simple piece of code. What pandas has done is realized that our customers and orders objects both have a column named CustomerID. With this knowledge, it uses common values found in that column of both DataFrame objects to related the data in both and form the merged data based on inner join semantics.
In [27]:
# sample data
left_data = {'Key1':['a','b','c'],'Key2':['x','y','z'],'lval1':[0,1,2]}
right_data = {'Key1':['a','b','c'],'Key2':['x','a','z'],'rval1':[6,7,8]}
left = pd.DataFrame(left_data,index=[0,1,2])
right = pd.DataFrame(right_data,index=[1,2,3])
left
Out[27]:
In [28]:
right
Out[28]:
In [29]:
# demonstrate merge without specifying columns to merge
# this will implicitly merge on all common columns
left.merge(right)
Out[29]:
In [31]:
# demonstrate merge using an explicit column
# on needs the value to be in both DataFrame objects
left.merge(right,on='Key1')
Out[31]:
In [32]:
# merge explicitly using two columns
left.merge(right,on=['Key1','Key2'])
Out[32]:
In [33]:
# join on the row indices of both matrices
pd.merge(left,right,left_index=True,right_index=True)
Out[33]:
This has identified that the index labels in common are 1 and 2, so the resulting DataFrame has two rows with these values and label in the index. pandas then creates a column in the result for every column in both objects and then copies the values.
As both DataFrame objects had a column with a identifcal name, key, the columns in the result have the _x and _y suffixes appended to them to identify the DataFrame they originated from.
The default type of join performed by pd.merge() is an inner join. To use another join method, the method of join to be used can be specified using the how parameter of the pd.merge() function. The valid options are:
In [34]:
# outer join, merges all matched data
# and fills unmatched items with NaN
left.merge(right,how='outer')
Out[34]:
In [35]:
# left join, merges all matched data and only fills unmatched
# items from the left dataframe with NaN filled for the
# unmatched items in the result
# rows with labels 0 and 2
# match on key1 and key2 the row with label 1 is from left
left.merge(right,how="left")
Out[35]:
In [36]:
# right join, merges all matched data and only fills unmatched
# items from the right with NaN filled for the unmatched items
# in the result
# rows with labels 0 and 1 match on key1 and key2
# the row with label 2 is from right
left.merge(right, how='right')
Out[36]:
In [37]:
# join left with right (default method is outer)
# and since these DataFrame objects have duplicate column names
# we just specify lsuffix and rsuffix
left.join(right,lsuffix='_left',rsuffix='_right')
Out[37]:
In [38]:
# join left with right with an inner join
left.join(right, lsuffix='_left',rsuffix='_right',how='inner')
Out[38]:
Data is often stored in a stacked format, which is also referred to as record format; this is common in databases, .csv files and Excel spreadsheets. In a stacked format, the data is often not normalized and has repeated values in many columns, or values that should logically exists in other tables.
In [39]:
# read in accellerometer data
sensor_readings = pd.read_csv('../../data/accel.csv')
sensor_readings
Out[39]:
In [40]:
# extract X-axis readings
sensor_readings[sensor_readings['axis'] == 'X']
Out[40]:
In [42]:
# pivot the data. Interval becomes the index, the columns are
# the current axes values and use the readings as values
sensor_readings.pivot(index='interval',columns='axis',values='reading')
Out[42]:
This has taken all of the distinct values from the axis column, and pivoted them into columns on the new DataFrame, while filling in values for the new columns from the appropriate rows and columns of the original DataFrame.
Stacking pivots a level of column labels to the row index. Unstacking pivots a level of the row index into the column index.
One of the differences between stacking/unstacking and peforming a pivot is that unlike pivots the stack and unstack functions will be able to pivot specific levels of a hierarchical index. Also, where a pivot retains the same number of levels on an index, a stack and unstack will always increase the levels on the index of one of the axes and decrease the levels on the other axis.
In [46]:
# simple DataFrame with one column
df = pd.DataFrame({'a':[1,2]}, index={'one','two'})
df
Out[46]:
Stacking will move one level of the columns index into a new level of the rows index. As our DataFrame only has one level, this collapses a DataFrame object into a Series object with a hierarchical row index:
In [45]:
# push the column to another level of the index
# the result is a Series where values are looked up through
# a multi-index
stacked1 = df.stack()
stacked1
Out[45]:
In [47]:
# lookup one / a using just index via a tuple
stacked1[('one','a')]
Out[47]:
In [48]:
# DataFrame with two columns
df = pd.DataFrame({'a':[1,2],'b':[3,4]},index={'one','two'})
df
Out[48]:
In [49]:
# push the two columns into a single level of index
stacked2 = df.stack()
stacked2
Out[49]:
In [50]:
# lookup value with index one / b
stacked2[('two','b')]
Out[50]:
In [51]:
# make two copies of the sensor data, one for each user
user1 = sensor_readings.copy()
user2 = sensor_readings.copy()
# add names to the two copies
user1['who'] = 'Mike'
user2['who'] = 'Mikael'
# for demonstration, lets scale user2's readings
user2['reading'] *= 100
# and reorganize this to have a hierarchical row index
multi_user_sensor_data = pd.concat([user1,user2]).set_index(['who','interval','axis'])
multi_user_sensor_data
Out[51]:
In [52]:
# look up user data for Mike using just the index
multi_user_sensor_data.loc['Mike']
Out[52]:
In [53]:
# readings for all users and axes at interval 1
multi_user_sensor_data.xs(1,level='interval')
Out[53]:
Unstacking will move the last level of the row index into a new level of the columns index resulting in columns having MultiIndex.
In [54]:
# unstack axis
multi_user_sensor_data.unstack()
Out[54]:
In [55]:
# unstack at level=0
multi_user_sensor_data.unstack(level=0)
Out[55]:
Multiple levels can be unstacked simultaneously by passing a list of the level to .unstack(). Additionally if the levels are named, they can be specified by name instead of location.
In [57]:
# unstack who and axis levels
unstacked = multi_user_sensor_data.unstack(['who','axis'])
unstacked
Out[57]:
In [58]:
# and we can of course stack what we have unstacked
# this re-stacks who
unstacked.stack(level='who')
Out[58]:
There are couple of points to be noticed here:
Melting is a type of unpivoting, and is often referred as changing a DataFrame object from wide format to long format.
Technically, it is the process of reshaping a DataFrame into a format where two or more columns, referred to as variable and value are created by unpivoting column lables in the variable column and then moving the data from these columns into the appropriate location in the value column.
In [59]:
# we will demonstrate melting in this DataFrame
data = pd.DataFrame({'Name':['Mike','Mikael'],'Height':[6.1,5.9], 'Weight':[220,185]})
data
Out[59]:
In [60]:
# melt it, use Name as the id,
# Height and Weight columns as the variables
pd.melt(data,id_vars=['Name'],value_vars=['Height','Weight'])
Out[60]:
The data is now structured so that it is easy to extract the value for any combination of variable and Name. Additionally, when in this format it is easier to add a new variable and measurement as the data can simply be added as a new row instead of requiring a change of structure to DataFrame by adding a new column.
In [62]:
# stacked scalar access can be a lot faster than
# column access
# time the different methods
import timeit
t = timeit.Timer("stacked1[('one','a')]","from __main__ import stacked1, df")
r1 = timeit.timeit(lambda: stacked1.loc[('one','a')],number=10000)
r2 = timeit.timeit(lambda: df.loc['one']['a'],number=10000)
r3 = timeit.timeit(lambda: df.iloc[1,0],number=10000)
# and the results are
r1,r2,r3
Out[62]: